Mastering Web Scraping for Data Collection

Marine Bardou, Aurélien Goutsmedt, and Thomas Laloux

1 Training Goals

Motivations

  • Stimulating ideas for potential research using scraping and/or network analysis
  • Introducing to the basic questions and steps in these kinds of analysis
  • Presenting the tools necessary for it (mostly in R)
  • Providing bits of codes and practical tips

What we will do

  • Scraping central bankers’ speeches on the Bank of International Settlements website
  • Identifying to which central bank the speaker belongs and where the speech is given
  • Building a network of institutions/events and analyzing this network
  • Building informative visualization of the network

Requisits

  • Install R and RStudio
  • These slides are built from a .qmd (quarto) document \(\Rightarrow\) all the codes used in these slides can be run in RStudio
# These lines of code has to be run before if you want to install all the packages directly

# pacman will be used to install (if necessary) and load packages
# We install pacman if it is not already installed
if(length(grep("pacman", installed.packages())) == 0) install.packages("pacman")
library(pacman)

# Installing the needed packages in advance
p_load(tidyverse, # basic suite of packages
       glue, # useful for building string (notably for url)
       scico, # color palettes
       patchwork, # for juxtaposition of graphs
       DT) # to display html tables

2 What is web Scraping

What is web scraping ?

  • The web scraping is a method for extracting data available in the World Wide Web
  • The World Wide Web, or “Web”, is a network of websites (online documents coded in html and css)
  • A web scraper is a program, for instance in R, that automatically read the html structure of a website and extract the relevant content (text, hypertext references, tables)
  • Useful when many pages to scrape

API vs. web scraping

  • API (Application Programming Interface) provides a structured and predictable way to retrieve data from a service. It’s like ordering from a menu; you request specific data and receive it in a structured format
  • Web Scraping is the process of programmatically extracting data from the web page’s HTML itself. It’s akin to manually copying information from a book; you decide what information you need and how to extract it

API vs. web scraping

  • Control and Structure: APIs offer structured access to data, whereas web scraping requires parsin HTML and often cleaning the data yourself.
  • Ease of Use: Using an API can be simpler since it’s designed for data access. Scraping requires dealing with HTML changes and is more prone to breaking.
  • Availability: Not all websites offer an API, making web scraping a necessity in some cases.
  • Limitations: APIs often have rate limits and may require authentication. Web scraping can bypass these limits but might violate terms of service.

Small data everywhere

  • A large possibility of data you can collect:
    • official documents/speeches
    • agenda and meetings
    • list of personnel or experts in commission
    • Laws or negotiations
  • We can do it “historically” through the Internet Archive

3 The Ethics of Web Scraping

Ethical considerations

  • Legal Considerations: Not all data is free to scrape. Websites’ terms of service may explicitly forbid web scraping, and in some jurisdictions, scraping can have legal implications
    • What is “forbidden” by a website is not necessary “illegal”
  • Privacy Concerns: Scraping personal data can raise significant privacy issues and may be subject to regulations like GDPR in Europe
  • Website Performance: Scraping, especially if aggressive (e.g., making too many requests in a short period), can negatively impact the performance of a website, affecting its usability for others

Ethical practices

  • Respect robots.txt: This file on websites indicates which parts should not be scraped
  • Rate Limiting: Making requests at a reasonable rate to avoid overloading the website’s server
  • User-Agent String: Identifying your scraper can help website owners understand the nature of the traffic
  • Data Use: Consider the ethical implications of how scraped data is used. Ensure it respects the privacy and rights of individuals

4 How to scrape a website?

The useful packages in R

  • rvest: scraping and cleaning html code
  • polite: responsible web etiquette (informing the website that you are scraping)
  • RSelenium: using a bot to interact with a website
p_load(rvest, # scraping and manipulating html pages
       polite, # scraping ethically
       RSelenium) # scraping by interacting with RSelenium

The Role of Sitemaps

  • Sitemap: to inform search engines about URLs on a website that are available for web crawling
    • Understand the structure of a website
    • Find where is the information we want to extract

Being respectful of the website

Declaring yourself:

session <- polite::bow(bis_website_path, 
                       user_agent = "polite R package - used for academic training by Aurélien Goutsmedt (aurelien.goutsmedt[at]uclouvain.be)")
cat(session$robotstxt$text)
#Format is:
#       User-agent: <name of spider>
#       Disallow: <nothing> | <path>
#-------------------------------------------

User-Agent: *
Disallow: /dcms
Disallow: /metrics/
Disallow: /search/
Disallow: /staff.htm
Disallow: /embargo/
Disallow: /app/
Disallow: /goto.htm
Disallow: /login
#Disallow: /cbhub
Disallow: /cbhub/goto.htm
Disallow: /doclist/
# Committee comment letters
Disallow: /publ/bcbs*/
Disallow: /bcbs/ca/
Disallow: /bcbs/commentletters/
Disallow: /*/publ/comments/
# Hide the Basel Framework standards, only chapters should be indexed.
Disallow: /basel_framework/standard/

Sitemap: https://www.bis.org/sitemap.xml
session$robotstxt$sitemap
    field useragent                           value
1 Sitemap         * https://www.bis.org/sitemap.xml

Using sitemap

Code
# This function goes to a sitemap page, and extract all the urls found
extract_url_from_sitemap <- function(url, delay = 1) { 
  urls <- read_html(url) %>% 
    html_elements(xpath = ".//loc") %>% 
    html_text()
  Sys.sleep(delay) # You set a delay to avoid overloading the website
  return(urls)
}

# insistently allows to retry when you did not succeed in loading the page
insistently_extract_url <- insistently(extract_url_from_sitemap, rate = rate_backoff(max_times = 5)) 

document_pages <- extract_url_from_sitemap(session$robotstxt$sitemap$value) %>% 
  .[str_detect(., "documents")] # We keep only the URLs for documents

bis_pages <- map(document_pages[1:5], # showing the code just on the first five years
                 ~insistently_extract_url(url = ., 
                                          delay = session$delay))

bis_pages <- tibble(year = str_extract(document_pages[1:5], "\\d{4}"),
                    urls = bis_pages) %>% 
  unnest(urls)

Scraping a BIS speech with rvest

“https://www.bis.org/review/r241010a.htm”

Scraping BIS: understanding URLs

The second page of our query:

https://www.bis.org/cbspeeches/index.htm?fromDate=01%2F01%2F2023&cbspeeches_page=2&cbspeeches_page_length=25

page <- 2
day <- "01"
month <- "10"
year <- 2024
url_second_page <- glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page={page}&cbspeeches_page_length=25")

Scraping one page: using a scrape helper

  • Scraping add-ons on browser help you navigating through elements in a webpage
    • XPath is the path towards a specific part of a webpage
    • CSS selectors are first for styling web pages, but allows to match position of an element within HTML structures
  • Typical scraping helpers: ScrapeMate and SelectorGadget

Scraping one page: mixing rvest and RSelenium

# Launch Selenium to go on the website of bis
driver <- rsDriver(browser = "firefox", # can also be "chrome"
                   chromever = NULL,
                   port = 4444L) 
remote_driver <- driver[["client"]]

Scraping one page: mixing rvest and RSelenium

remote_driver$navigate(url_second_page)
Sys.sleep(session$delay)


element <- remote_driver$findElement("css selector", ".item_date")
element$getElementText()[[1]]
[1] "09 Oct 2024"


elements <- remote_driver$findElements("css selector", ".item_date")
length(elements)
[1] 25
elements[[25]]$getElementText()[[1]]
[1] "02 Oct 2024"

Scraping one page

Code
data_page <- tibble(date = remote_driver$findElements("css selector", ".item_date") %>% 
                      map_chr(., ~.$getElementText()[[1]]),
                    info = remote_driver$findElements("css selector", ".item_date+ td") %>% 
                      map_chr(., ~.$getElementText()[[1]]),
                    url = remote_driver$findElements("css selector", ".dark") %>% 
                      map_chr(., ~.$getElementAttribute("href")[[1]])) %>% 
  separate(info, c("title", "description", "speaker"), "\n")

Scraping all the pages

starting_url <- glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page=1&cbspeeches_page_length=25")
remote_driver$navigate(starting_url)
# Extract the total number of pages
nb_pages <- remote_driver$findElement("css selector", ".pageof")$getElementText()[[1]] %>%
  str_remove_all("Page 1 of ") %>%
  as.integer()

# creating a list objet to allocate progressively information
metadata <- vector(mode = "list", length = nb_pages)

for(page in 1:nb_pages){
  url <- glue("https://www.bis.org/cbspeeches/index.htm?fromDate={day}%2F{month}%2F{year}&cbspeeches_page={page}&cbspeeches_page_length=25")
  remote_driver$navigate(url)
  nod <- nod(session, url) # introducing politely to the new page
  Sys.sleep(session$delay) # using the delay time set by polite

  metadata[[page]] <- tibble(date = remote_driver$findElements("css selector", ".item_date") %>% 
                            map_chr(., ~.$getElementText()[[1]]),
                          info = remote_driver$findElements("css selector", ".item_date+ td") %>% 
                            map_chr(., ~.$getElementText()[[1]]),
                          url = remote_driver$findElements("css selector", ".dark") %>% 
                            map_chr(., ~.$getElementAttribute("href")[[1]])) 
}

metadata <- bind_rows(metadata) %>% 
  separate(info, c("title", "description", "speaker"), "\n")
driver$server$stop() # we close the bot once we've finished

5 Questions

6 Useful Resources